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Abstract 

We derive generalization error bounds — bounds on tiie expected inaccuracy of tiie predictions — for 
traditional time series forecasting models. Our results hold for many standard forecasting tools including 
autoregressive models, moving average models, and, more generally, linear state-space models. These 
bounds allow forecasters to select among competing models and to guarantee that with high probability, 
their chosen model will perform well without making strong assumptions about the data generating 
process or appealing to asymptotic theory. We motivate our techniques with and apply them to standard 
economic and financial forecasting tools — a GARCH model for predicting equity volatility and a dynamic 
stochastic general equilibrium model (DSGE), the standard tool in macroeconomic forecasting. We 
demonstrate in particular how our techniques can aid forecasters and policy makers in choosing models 
which behave well under uncertainty and mis-specification. 

Keywords: Generalization error. Prediction risk, Model selection. 

1 Introduction 

Generalization error bounds are probabilistically valid, non-asymptotic tools for characterizing the predic- 
tive ability of forecasting models. This methodology is fundamentally about choosing particular prediction 
functions out of some class of plausible alternatives so that, with high reliability, the resulting predictions 
will be nearly as accurate as possible ( "probably approximately correct" ) . While many of these results are 
useful only for classification problems (i.e., predicting binary variables) and for independent and identically 
distributed (IID) data, this paper adapts and extends these methods to time series models, so that economic 
and financial forecasting techniques can be evaluated rigorously. In particular, these methods control the 
expected accuracy of future predictions from mis-specified models based on finite samples. This allows for 
immediate model comparisons which neither appeal to asymptotics nor make strong assumptions about the 
data-generating process, in stark contrast to such popular model-selection tools as AIC. 

To fix ideas, imagine IID^ data ((Yi, Vi), . . . , (y„, V„)) with (Y^, Xi) E y x X, some prediction function 
f : X ^ y, and a loss function £ : yxy ^ which measures the cost of bad predictions. The generalization 
error or risk of / is 

i?(/):=E[£(r,/(V))] (1) 

where the expectation is taken with respect to P, the joint distribution of {Y,X). The generalization error 
measures the inaccuracy of our predictions when we use / on future data, making it a natural criterion for 

*Email: dajmcdon@indiaiia.cdu, cshalizi@cmu.cdu, mark@cmu.cdu. This work is partially supported by a grant from the 
Institute for New Economic Thinking. CRS was also partially supported by NIH Grant # 2 ROl NS047493. The authors wish 
to thank David N. Dejong, Larry Wasserman, Alessandro Rinaldo and Darren Homrighausen for valuable suggestions. 

^The IID assumption here is just for ease of exposition; we develop dependent-data results at length below. 
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model selection, and a target for performance guarantees. To actually calculate the risk, we would need to 
know the data-generating distribution P and have a single fixed prediction function /, neither of which is 
common. Because explicitly calculating the risk is infeasible, forecasters typically try to estimate it, which 
calls for detailed assumptions on P. The alternative we employ here is to find upper bounds on risk which 
hold uniformly over large classes of models J- from which some particular / is chosen, possibly in a data 
dependent way, and uniformly over distributions P. 

Our main results in Section 4 assert that for wide classes of time series models (including VARs and 
state-space models), the expected cost of poor predictions is bounded by the model's in-sample performance 
inflated by a term which balances the amount of observed data with the complexity of the model. The bound 
holds with high probability under the unknown distribution P assuming only mild conditions — existence of 
some moments, stationarity, and the decay of temporal dependence as data points become widely separated 
in time. As a preview, the following provides the general form of the result. Specific results which have this 
flavor are Theorem 4.3 and Theorem 4.6 and their corollaries. We give applications in Section 5. 

Result. Given a time series Yi, . . . ,Yn satisfying some mild conditions and a prediction function f chosen 
from a class of functions J- (possibly by using the observed sample), then, with probability at least 1 — r/, 

Rif)<Rn{f) + CAv,n) (2) 

where R{f) is the expected cost of making prediction errors on new samples, Rn{f) is the average cost of 
in-sample prediction errors, Cjr{rj^ n) >Q balances the complexity of the model from which f was chosen with 
the amount of data used to choose it. 

There are many ways to estimate the generalization error, and a comprehensive review is beyond the 
scope of this paper. Traditionally, time series analysts have performed model selection by a combination of 
empirical risk minimization, more-or-less quantitative inspection of the residuals, and penalties like AIC. In 
many applications, however, what really matters is prediction, and none of these techniques work to control 
generalization error, especially for mis-specifled models. Empirical cross-validation is a partial exception, 
but it is tricky for time series; see Racine [44] and references therein. In economics, forecasters have long 
recognized the difficulties with these methods, preferring to use a pseudo-cross validation approach instead: 
choose a prediction function using the initial portion of a data set and evaluate its performance on the 
remainder (c.f. [2, 16, 19, 50]). This procedure provides approximate solutions to the problem of estimating 
the generalization error, but it can be biased toward overfitting — giving too much credence to the observed 
data — and hence tends to underestimate the true risk for at least three reasons. First, the held-out data, or 
test set, is used to evaluate the performance of competing models despite the fact that it was already partially 
used to build those models. For instance, the recent housing and flnancial crises have precipitated attempts 
to enrich existing models with mechanisms designed to enhance their ability to predict just such a crisis (c.f. 
[21-23]). Second, the test set may reflect only a small sampling of possible phenomena which could occur. 
Finally, large departures from the normal course of events such as the recessions in 1980-82 and periods 
before 1960 are often ignored, as in [19]. While these periods are considered rare and perhaps unpredictable, 
models which are robust to these sorts of disruptive events will lead to more accurate predictions in future 
times of turmoil. 

In contrast to the model evaluation techniques typically employed in the literature, generalization error 
bounds provide rigorous control over the predictive risk as well as reliable methods of model selection. They 
are robust to wide classes of data generating processes and are finite-sample rather than asymptotic in nature. 
In a broad sense, these methods give confidence intervals which are constructed based on concentration of 
measure results rather than appeals to asymptotic normality. The results are easy to understand and can be 
reported to policy makers interested in the quality of the forecasts. Finally, the results are agnostic about the 
model's specification: it does not matter if the model is wrong, whether the parameters have interpretable 
economic meaning, or whether the estimation of the parameters is performed only approximately (linearized 
DSGEs or MCMC). In all of these cases, we can still make strong claims about the ability of the model to 
predict the future. 

The bounds we derive here are the first of their kind for the time series models typically used in applied 
settings — finance, economics, engineering, etc. — but there are results for other models more common 
to computer science (cf. Meir [37], Mohri and Rostamizadeh [38, 39]). Those results require bounded loss 
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functions, making them less general than ours, as well as hinging on specific forms of regularization which are 
rarely used in time series. Furthermore, they rely on prediction functions f : X ^ y where the dependence 
occurs in the X space. Therefore, these results are extensible to AR models or others which depend on only 
the most recent past (assuming appropriate model space constraints are satisfied) but not, for instance, to 
standard state-space models. For another view on this problem, [36] shows that stationarity alone can be 
used to regularize an AR model following the results in [38] , but leads to bounds which are much worse than 
those given here, despite the stricter assumption of bounded loss. 

The meaning of such results for forecasters, or for those whose scientific aims center around prediction 
of empirical phenomena, is plain: they provide objective ways of assessing how good their models really are. 
There are, of course, other uses for scientific models: for explanation, for the evaluation of counterfactuals 
(especially, in economics, comparing the consequences of different policies), and for welfare calculations. Even 
in those cases, however, one must ask why this model rather than another?, and the usual answer is that the 
favored model approximates reality better than the alternative — it gets the structure approximately right. 
Empirical evidence for structural correctness, in turn, usually takes the form of an argument from empirical 
success: it would be very surprising if this model fit the data so well when it got the structure wrong [33]. 
Our results, which directly address the inference from past data-matching to future performance, are thus 
relevant even to those who do not aim at prediction as such. 

The remainder of this paper is structured as follows. Section 2 provides motivation and background for our 
results, giving intuition in the IID setting by focusing on concentration of measure ideas and characterizations 
of model complexity. Section 3 gives the explicit assumptions we make and describes how to leverage powerful 
ideas from time series to generalize the IID methods. Section 4 states and proves risk bounds for the time 
series forecasting setting, while we demonstrate how to use the results in Section 5 and give some properties 
of those results in Section 6. Finally, Section 7 concludes and illustrates the path toward generalizing our 
methods to more elaborate model classes. 



2 Statistical learning theory 

Our goal is to control the risk of predictive models, i.e., their expected inaccuracy on new data from the 
same source as that used to fit the model. To orient readers new to this approach, we sketch how classical 
results in the IID setting are obtained. 

Let f : X ^ y he some function used for making predictions of Y from X. We define a loss function 
i : y X y ^ M+ which measures the cost of making poor predictions. Throughout this paper, we will 
assume that £{y,y') is a function solely of the difference y — y' where £{■) is nonnegative and ^(0) = 0. For 
the remainder of the paper, we take the liberty of denoting that function ily — y'). Then the risk of any 
predictor f €z J- (where / is fixed independently of the data) is given by 

R{f)=E[eiY-fiX))], (3) 

where {X, Y) ^ P. The risk or generalization error is the expected cost of using / to predict Y from X on 
a new observation. 

Since the true distribution P is unknown, so is R{f), but we can try to estimate it based on our observed 
data. The training error or empirical risk of / is 

1 " 

^"(/)^=-E^(^'-/(^'))- (4) 

In other words, the in-sample training error, i?„(/), is the average loss over the actual training points. 
Because the true risk is an expectation value, we can say that 

Rn{f)=R{f)+ln{f), (5) 

where 7n(/) is a mean-zero noise variable that reflects how far the training sample departs from being 
perfectly representative of the data-generating distribution. By the law of large numbers, for each fixed 
/, 7n(/) — >■ as n — >■ oo, so, with enough data, we have a good idea of how well any given function will 
generalize to new data. 
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However, forecasters rarely have the luxury of a theory which fixes for them, in advance of the data, a 
single function /, free of adjustable parameters. Rather, there is a class of plausible functions J^, possibly 
indexed by some parameters 9 G Q, which we will call "a model" . One picks out a single function (chooses one 
particular parameter point) from the model via some method — maximum likelihood, Bayesian updating, 
indirect inference, ad hoc methods — which often amounts to minimizing the in-sample loss. In this case, 
the result is 

/ = argmini?„(/) = argmin (i?(/) + 7„(/)). (6) 

Tuning the parameters so that / fits the training data well thus conflates predicting future data well (low 
true risk i?(/)) with exploiting the accidents and noise of the training data (large negative finite-sample 
noise 7n(/)). The true risk of / will generally be bigger than its in-sample risk precisely because we picked 
it to match the data well. In doing so, / ends up reproducing some of the noise in the data and therefore 
will not generalize as well as Rn{f) suggests. The difference between the true and apparent risk depends on 
the magnitude of the sampling fluctuations: 

RU) - Rnif) < sup |7„(/)| = r„(j-) . (7) 

The main goal of statistical learning theory is to mathematically control T„{T), finding tight bounds 
on this quantity which make weak assumptions about the unknown data-generating process; i.e., to bound 
over-fitting. Using more flexible models (allowing more general functional forms or distributions, adding 
parameters, etc.) has two contrasting effects. On the one hand, it improves the best possible accuracy, 
lowering the minimum of the true risk. On the other hand, it increases the ability to, as it were, memorize 
noise for any fixed sample size n. This qualitative observation — a form of the bias-variance trade-off from 
basic estimation theory — can be made usefully precise by quantifying the complexity of model classes. A 
typical result is a confidence bound on r„ (and hence on the over- fitting) , which says that with probability 
at least 1 — r/, 

r„(J-) < <!>(*( J-),n,r,) , (8) 

where ^(•) measures the complexity of the model 

To give specific forms of $(•), we need to show that, for a particular /, R{f) and Rn{f) will be close to 
each other for each fixed n, without knowledge of the distribution of the data. We also need to understand 
the complexity, '^{J'), so that we can claim R{f) and Rn{f) will be close uniformly over all f E Together 
these two pieces tell us, despite little knowledge of the data generating process, how bad the / which we 
choose will be at forecasting future observations. 



2.1 Concentration 



The first step to controlling the difference between the empirical and expected risk is to show that for each 
f (z T, R{f) — Rnif) is small with high probability. The following is a standard result (c.f. [55] or [12]). 



Theorem 2.1. Suppose that < £{y, y') < K < oo. Then for each f £ 



(|i?(/) -]?„(/) I >e) <2exp| 



(9) 



Proof. The proof begins by using an exponential version of Markov's inequality. For a fixed /, we have 



E 



Rnif) = Rif)- Therefore 



[Rif) - Rnif) > e) = P (exp{s(i?(/) - ]?„(/))} > exp{se}) 

exp{s(i?(/) -]?„(/))} 
expjse} 



E 



< 



(10) 
(11) 
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We can bound the moment generating function, E j^exp{s(i?(/) — Rn{f))} via Hoeffding's inequality [26]: 

n 

E[exp{s(i?(/) - Rnif))}] = [exp{^ [R{f f{X,))]}] (12) 



. n 

i=l 



i=l 



With this resuh, we have 



P (i?(/) - i?„ (/) > e) < exp{-se} exp | ^ | . (14) 



This holds for all s > 0, so we can minimize the right hand side in s (this is known as Chernoff's method). 
The minimum occurs for s — AntjK^ . Substitution gives 

i?(/)-i?n(/)>^) <exp(-^l. (15) 



Exactly the same argument holds for P(i?(/) — Rn{f) < — e), so by a union bound, we have the result. ■ 

This result is quite powerful: it says that the probability of observing data which will result in a training 
error much different from the expected risk goes to zero exponentially with the size of training set. The only 
assumption necessary was that £(y — y') < K. In fact, even this assumption can be removed and replaced 
with some moment assumptions, as will be done for our main results below. 

Theorem 2.1 holds for the single function /, and we want a similar result to hold uniformly over all 
functions f £ and in particular, any / that we might choose using the training data, i.e., we wish to 

bound P l^supjTgjr \R{f) — Rn{f)\ > ej- How can we achieve this extension? 
2.2 Capacity 

For "small" models, we can just count the number of functions in the class and take the union bound. 
Suppose that J-" = {/i, . . . , /at}. Then we have 



sup |i?(/,) - i?„(/,)| > e < ^P (|i?(/.) - Rn{.U)\ > e) (16) 



1=1 

<JVexp|-^|, (17) 

by Theorem 2.1. Most interesting models are not small in this sense, but similar results hold when model 
size is measured appropriately. 

There are a number of measures for the size or capacity of a model. Algorithmic stability [4, 5, 28] 
quantifies the sensitivity of the chosen function to small perturbations to the data. Similarly, maximal 
discrepancy [53] asks how different the predictions could be if two functions are chosen using two separate 
data sets. A more direct, functional-analytic approach partitions J-' into equivalence classes under some 
metric, leading to covering numbers [42, 43]. Rademacher complexity [3] directly describes a model's ability 
to fit random noise. We focus on a measure which is both intuitive and powerful: Vapnik-Chervonenkis 
(VC) dimension [52, 53]. 

VC dimension starts as an idea about collections of sets. 

Definition 2.2. Let U be some (infinite) set and S a finite subset o/U. Let C be a family of subsets o/U. 
We say that C shatters S if for every S' C S , 3C £ C such that S' — S H C . 

Essentially, C can shatter a set S if it can pick out every subset of points in S. This says that the 
collection C is very complicated or flexible. The cardinality of the largest set S that can be shattered by C 
is the latter's VC dimension. 
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Definition 2.3 (VC dimension). The Vapnik-Chervonenkis (VC) dimension of a collection C of subsets of 
U is 

vcd(C) := svlp{\S\ : S CV and S is shattered by C}. (18) 

To see why this is a "dimension" , we need one more notion. 

Definition 2.4 (Growth function). The growth function G{C,n) of a collection C of subsets of V is the 
maximum number of subsets which can be formed by intersecting a set S dV of cardinality n with C, 

G(n,C):= sup \S AC\ (19) 

SCU : \S\=n 

The growth function counts how many effectively distinct sets the collection contains, when we can only 
observe what is going on at n points, not all of U. If n < VCD(C), then from the definitions G{n,C) = 2", If 
the VC dimension is finite, however, and n > vcd(C), then G(n,C) < 2", and in fact it can be shown [54] 
that 

G(n,C) < (n + l)™°('^). (20) 

This polynomial growth of capacity with n is why vCD is a "dimension" . 

Using VC dimension to measure the capacity of function classes is straightforward. Define the indicator 
function 1^(2;) to take the value 1 ii x E A and otherwise. Suppose that / e J^, / : U — > M. Each / 
corresponds to the set 

C/ = {(M,a):l(o.oo)(/(")-6) = l, ueV, 6 G M}, (21) 

so J-' corresponds to the class Cjr := {C/ : / G J-}. Essentially, the growth function G(n,vCD(J^)) counts 
the effective number of functions in i.e., how many can be told apart using only n observations. When 
VCD(J^) < oo, this number grows only polynomially with n. This observation lets us control the risk over 
the entire model, providing one of the pillars of statistical learning theory. 

Theorem 2.5 (Vapnik and Chervonenkis [54]). Suppose that VCd(J^) < oo and < i{y,y') < K < oo. 
Then, 

p|^sup|i?(/)-i?„(/)|>e^ <4(2n + l)-°(^)exp|-^| , (22) 

where Ki depends only on K and not n or T . 

The proof of this theorem has a similar flavor to the union bound argument given in (17). 

This theorem has as an immediate corollary a bound for the out-of-sample risk. Since supy^jr is inside 
the probability statement in (22), it applies to both pre-specified and to data-dependent functions, including 
any / chosen by fitting a model or minimizing empirical risk. 

Corollary 2.6. When Theorem 2.5 applies, for any 77 > and any f T , with probability at least 1 — r], 

Rif) < RM) + ^,^ vCD(^)log(2n + l)^ j^_ ^^3^ 

The factor Ki can be calculated explicitly but is unilluminating and we will not need it. Conceptually, 
the right-hand side of this inequality resembles standard model selection criteria, like AIC or BIC, with 
in-sample fit plus a penalty term which goes to zero as n 00. Here however, the bound holds with high 
probability despite lack of knowledge of P and it has nothing to do with asymptotic convergence: it holds 
for each n. It does however hold only with high P probability, not always. 

VC dimension is well understood for some function classes. For instance, ifJ^ = {xi->7-x:7G M^} 
then vcd(J^) — p + 1, i.e. it is the number of free parameters in a linear regression plus 1. VC dimension 
does not always have such a nice relation to the number of free parameters however; the classic example 
is the model — {x 1-^ sm{Lux) : w E R}, which has only one free parameter, but vcd(J^) = 00.^ At the 

^This result follows if we can show that for any positive integer J and any binary sequence (ri,...,rj), there ex- 
ists a vector {xi, . . . ,xj) such that l^Q ij{sin{ujXi)) = ri. If we choose Xi = 27rl0~*, then one can show that taking 

ui = ^ (5I]/=i(t ~ ^ solves the system of equations. 
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same time, there are model classes (support vector machines) which may have infinitely many parameters 
but finite VC dimension [11]. This illustrates a further difference between the statistical learning approach 
and the usual information criteria, which are based on parameter-counting. 

The concentration results in Theorem 2.5 and Corollary 2.6 work well for independent data. The first 
shows how quickly averages concentrate around their expectations: exponentially fast in the size of the data. 
The second result generalizes the first from a single function to entire function classes. Both results, as 
stated, depend critically on the independence of the random variables. For time series, we must be able to 
handle dependent data. In particular, because time-series data are dependent, the length n of a sample path 
Yi, . . . ,Yn exaggerates how much information it contains. Knowing the past allows forecasters to predict 
future data (at least to some degree), so actually observing those future data points gives less information 
about the underlying process than in the IID case. Thus, while in Theorem 2.1 the probability of large 
discrepancies between empirical means and their expectations decreases exponentially in n, in the dependent 
case, the effective sample size may be much less than n resulting in looser bounds. 

3 Time series 

In moving from the IID setting to time series forecasting, we need a number of modifications to our initial 
setup. Rather than observing input/output pairs (Yi, Xi), we observe a single sequence of random variables 
Yi:n ■= {Yi, ■ ■ ■ , Yn) whcrc each Yi takes values in W.^ We are interested in using functions which take past 
observations as inputs and predict future values of the process. Specifically, given data from time 1 to time 
n, we wish to predict time n + 1. 

While we no longer presume IID data, we still need to restrict the sort of dependent process we work 
with. We first remind the reader of the notion of (strict or strong) stationarity. 

Definition 3.1 (Stationarity). A random sequence Y^o is stationary when all its finite- dimensional distri- 
butions are time-invariant: for all t and all non-negative integers i and j , the random vectors Yt-.t+i and 
Yt^j.f^i^j have the same distribution. 

Stationarity does not imply that the random variables Yj are independent across time t, only that the 
unconditional distribution of Yt is constant in time. We limit ourselves not just to stationary processes, but 
also to ones in which widely-separated observations are asymptotically independent. Without this restriction, 
convergence of the training error to the expected risk could occur arbitrarily slowly, and finite-sample bounds 
may not exist. ^ The next definition describes the sort of serial dependence which we entertain. 

Definition 3.2 (/3-Mixing) . Consider a stationary random sequence Y^o defined on a probability space 
(f2, E,Poo). Let cTi-j = a{Yi;j) he the a-field of events generated by the appropriate collection of random 
variables. Let Pq be the restriction ofVao to cr_oo:0; IPa be the restriction ofV^o to <Ja:oo, md Fg^a be the 
restriction o/ Poo to cr(yoo:0) ^a:oo)- The coefficient of absolute regularity, or /3-mixing coefficient, /3a, is 
given by 

/3a ||Po X Pa-Po»a||Ty, (24) 

where \ \ ■ \ \tv is the total variation norm. A stochastic process is absolutely regular, or /3-mixing, if /3a ^ 
as a —)■ oo. 

This is only one of many equivalent characterizations of /3-mixing (see Bradley [6] for others). This 
definition makes clear that a process is /3-mixing if the joint probability of events which are widely separated 
in time approaches the product of the individual probabilities, i.e., that Y^o is asymptotically independent. 
Many common time series models are known to be /3-mixing, and the rates of decay are known up to constant 
factors which are functions of the true parameters of the process. Among the processes for which such results 
are known are ARMA models [40], GARCH models [7], and certain Markov processes — see Doukhan [17] 
for an overview. Additionally, functions of /3-mixing processes are /3-mixing, so if Poo could be specified by 
a dynamic factor model or DSGE or VAR, the observed data would satisfy this condition. 

^We can easily generalize this to arbitrary measurable spaces. 

*In fact, Adams and Nobel [1] demonstrate that for ergodic processes, finite VC dimension is enough to give consistency, 
but not rates. 
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Knowing /3a would let us determine the effective sample size of time series Yim. In effect, having n 
dependent-but-mixing data points is like having ii < n independent ones. Once we determine the correct 
/i, we can (as we will now show) use concentration results for IID data like those in Theorem 2.1 and 
Theorem 2.5 with small corrections. 

4 Risk bounds 

With the relevant background in place, we can put the pieces together to derive our results. We use /3-mixing 
to find out how much information is in the data and VC dimension to measure the capacity of the state- 
space model's prediction functions. The result is a bound on the generalization error of the chosen function 
/. After slightly modifying the definition of "risk" to fit the time-series forecasting scenario, and stating 
necessary technical assumptions, we derive risk bounds for wide classes of economic forecasting models. 

4.1 Setup and assumptions 

We observe a finite subsequence of random vectors Yi-n from a process l^oo defined on a probability space 
(ri, EjPoo), with Yi e MP. We make the following assumption on the process. 

Assumption A. Poo is a stationary, ^-mixing process with mixing coefficients^ /3a, Va > 0. 

Under stationarity, the marginal distribution of Yt is the same for all t. We deal mainly with the joint 
distribution of Yi-^+i, where we observe the first n observations and try predicting Y^+i- For the rest of 
this paper, we will call this joint distribution P. Our results extend to predicting more than one step ahead, 
but the notation becomes cumbersome. 

We must define generalization error and training error slightly differently for time series than in the IID 
setting. Using the same notion of loss functions as before, we consider prediction functions / : M"^^' — )■ W 

Definition 4.1 (Time series risk). 



The expectation is taken with respect to the joint distribution P and therefore depends on n. The 
function / may use some or all of the past to generate predictions. A function using only the most recent d 
observations as inputs will be said to have fixed memory of length d. Other functions have growing memory, 
i.e., / may use all the previous data to predict the next data point. This incongruity makes the notation for 
time series training error somewhat problematic. 

We will define the training error with a subscript i G N on / within the summation. Strictly speaking, 
there is only one function / which we are using to make forecasts. In typical fixed memory settings — 
standard VAR forecasting models and so on — fi = fj ~ f for all i,j S N. But for models with growing 
memory, a fixed forecasting method — an ARMA model, DSGE,^ or linear state-space model — will use 
all of the past to make predictions, so the dimension of the domain changes with i. We write the risk of / 
as a single function, because, once we parameterize a forecasting method, an entire sequence of forecasting 
functions /i , /2 , . . . is determined. 

Definition 4.2 (Time series training error). 



°In order to apply the results, one must either know Pa for some a or be able to estimate it with sufBcient precision and 
accuracy. McDonald et al. [34] shows how to estimate the mixing coefficients non-parametrically, based on a single sample from 
the process. 

®A DSGE is a nonlinear system of expectational difference equations, so estimating the parameters is nontrivial. Likelihood 
methods typically work by finding a linear approximation using Taylor expansions and the Kalman filter, though increasingly 
complex nonlinear methods are now intensely studied. See for instance Dejong and Dave [13], Fernandez- Villaverde [20] or 
Dejong et al. [15] 



(25) 




n-l 



(26) 



i—d 
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In order to make use of this single definition of training error, we let d > 0. In fixed memory cases — 
say an AR(2) — d has an obvious meaning, while with growing memory, d = is allowed. 

To control the generalization error for time series forecasting, we make one final assumption, about the 
possible magnitude of the losses. Specifically, we weaken the bounded loss assumption we used in §2 to allow 
for unbounded loss as long as we retain some control on moments of the loss. 

Assumption B. Assume that for all f E 



.if) 



€(r„+i-/(yi:„))' 



< M <oo. 



(27) 



Assumption B is still quite general, allowing even some heavy tailed distributions. 



4.2 Fixed memory 

We can now state our results giving finite sample risk bounds for the problem of time series forecasting. We 
begin with the fixed memory setting; the next section will allow the memory length to grow. 

Theorem 4.3. Suppose that Assumption A and Assumption B hold, that the model class T has a fixed 
memory length d < n, and that we have a sample Y". Let /i and a be integers such that 2fia + d < n. Then, 
for all e > 0, 

v(snp^^^^^)fr^>e] (28) 



Qnif) 



^ exp ( ( - 

< 8(2^. + 1)™°(^) exp { ^ ^ \ + 2fif3a-d, 

where W{-) is the Lambert W function. 

The implications of this theorem are considerable. Given a finite sample of length n, we can say that 
with high probability, future prediction errors will not be much larger than our observed training errors. It 
makes no difference whether the model is correctly specified. This stands in stark contrast to model selection 
tools like AIC or BIG which appeal to asymptotics. Moreover, given a model class we can say exactly 
how much data we need to have good control of the prediction risk. As the effective data size increases, the 
training error is a better and better estimate of the generalization error, uniformly over all of T. 

The Lambert W function in the exponential term deserves some explanation. The Lambert W function 
is defined as the inverse of f{w) = wexpw (cf. Gorless et al. [9]). A strictly, but only slightly, worse bound 
can be achieved by noting that 

2e2\ \ £8/3 



^"^r r^J^7 -4^ (29) 
for all ee [0,1]. 

The difference between expected and empirical risk is only interesting when Rnif) exceeds Rn{f). Due 
to the supremum, events where the training error exceeds the expected risk are irrelevant. Therefore, we 
are only concerned with < Rn{f) < Rn{f)- Of course, as discussed in Section 2, for most estimation 
procedures, / is chosen to make Rn{f) as small as possible. 

One way to understand this theorem is to visualize the tradeoff between confidence e and effective data 
fi. Gonsider, by way of illustration, what happens when vcd(J^) = 1, /3a = 0, and M = 1. Then (28) and 
(29) become 

P l^sup i?„(/) - Rnif) > < 8exp |log(2/i + 1) - ^| (30) 

Our goal is to minimize e, thereby ensuring that the relative difference between the expected risk and the 
training risk is small. At the same time we want to minimize the right side of the bound so that the 
probability of "bad" outcomes — samples where the difference in risks exceeds e — is small. Of course 
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Figure 1: Visualizing the tradeoff between confidence (e, y-axis) and effective data {fi, a:;-axis). The black 
curve indicates the region where the bound becomes trivial. Below this line, the probability is bounded by 
1. Darker colors indicate lower probability of the "bad" event — that the difference in risks exceeds e. The 
colors correspond to the natural logarithm of the bound on this probability. 



we want to do this with as little data as possible, but the smaller we take e, the larger we must take ^ to 
compensate. We depict this tradeoff in Figure 1. 

The figure is structured so that movement toward the origin is preferable. We have tighter control on 
the difference in risks with less data. But moving in that direction leads to an increased probability of the 
bad event — that the difference in risks exceeds e. The bound becomes trivial below the solid black line 
(the bad event occurs with probability no larger than one). The desire for the bad event to occur with low 
probability forces the decision boundary to the upper right. 

Another way to interpret the plot is as a set of indifference curves. Anywhere in the same color region is 
equally desirable in the sense that the probability of equally bad events is the same. So if we had a budget 
constraint trading e and data (i.e. a line with negative slope), we could optimize within the budget set to 
find the lowest probability allowable. 

Before we prove Theorem 4.3, we will state a corollary which puts the same result in a form that is 
sometimes easier to use. 

Corollary 4.4. Under the conditions of Theorem 4-3, for any f (z J-, the following bound holds with 
probability at least I — rj, for all rj > 2^(3a-d'- 

Rn{f) < RnU) + Af^^^^^^^, (31) 

with 

^ ^ 4vCD(.F)log(2M+l) + log8/77^ ^^2) 

and ri' = ri - 2^i(3a-d- 
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We now prove both Theorem 4.3 and Corollary 4.4 to provide the reader with some intuition for the types 
of arguments necessary. We defer proof of the remainder of the theorems in this section to the appendix. 



Proof of Theorem 4-3 and Corollary 4-4- The first step is to move from the actual sample size n to the 
effective sample size fi which depends on the /3-mixing behavior. Let a and fi be non-negative integers such 
that 2a/i + d < n. Now divide Y" into 2/i blocks, each of length a, ignoring the remainder. Identify the 
blocks as follows: 



U, = {Y, : 2(j - l)a + 1 < z < (2j - l)a}, 
V, = {Y^ : (2j -l)a + l<i< 2ja}. 



(33) 
(34) 



Let U be the sequence of odd blocks Uj, and let V be the sequence of even blocks Vj. Finally, let U' be a 
sequence of blocks which are mutually independent and such that each block has the same distribution as a 
block from the original sequence. That is construct Uj such that 



£([/') = /:([/,) -/:(c/i), 



(35) 



where £(•) means the probability law of the argument. 

Let i?u(/), Rv'if), and i?v(/) be the empirical risk of / based on the block sequences U, U', and V 
respectively. Clearly ^„(/) = i(i?u(/) + -Rv(/)). Then, 



™/ Rn(f)~Rn{f) ^ 




Rn{f)-Ru{f) , i?„(/)-i?v(/) 



2Q„(/) 

Rn{f)-Rl]{f) 
Qnif) 

Rn{f)-Rlj{f) 
Qnif) 



sup 



2Q„(/) 
Rnif)-R^rif) 



> e 



> 2e 



> e 




nif) 

Rnif)-R^r{f) 

Qnif) 



> e 



XfeJ" Qn[J) I 



(36) 



(37) 
(38) 

(39) 

RAf)-Ru{f) 



Now, apply Lemma 4.1 inYu [56] (reproduced as Lemma A. 1 in Section A) to the of the event |supjgjr 

This allows us to move from statements about dependent blocks to statements about independent blocks 
with a slight correction. Therefore we have. 



> e 



y/e.F QnU) J 

< 2P ( sup > + 2{, - 

Qnij) I 



(40) 



where the probability on the right is for the cr-ficld generated by the independent block sequence U'. There- 
fore, 



/ Rn{f)^Ru.{f) ^ ' 

Qnif) > ^ 



< 8(2// + 1)^^°(^) exp 




(41) 
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where we have apphed Theorem 7 m Cortes et al. [10] (reproduced as Lemma A. 2) to bound the independent 
blocks U'. 

To prove the corollary, set the right hand side of (41) to ij, take rj' — -q — 2{fi — l)/3a-d, and solve for e. 
We get that for all f £ J-, with probability at least I — rj, 



Solving the equation 



Rnjf) - Rnjf) , 

Qnif) - ^^'^ 



^ exp ( ( - 

r7' = 8(2A. + l)''exp<( ^ \ ^ ^ -\ (43) 




implies 



with 



.^M,ei^ (44) 



^ ^ 4vCD(.F)log(2A.+ l)+log8/V ^^^^ 



The only obstacle to the use of Theorem 4.3 is knowledge of vcd(J^). For some models, the VC dimension 
can be calculated explicitly. 

Theorem 4.5. For the class of AR(d) models, J-AR{d), 

vCD{TAR.{d)) = d + I. (46) 

For the class of VAR(d) models with k time series, J-vAR{k,d), 

vcd{Tv AR{k, d)) = kd+l. (47) 

Theorem 4.5 applies equally to Bayesian VARs. However, this is likely too conservative as the prior tends 
to restrict the effective complexity of the function class. ^ 

4.3 Growing memory 

Most macroeconometric forecasting model classes have growing rather than fixed-length memories. These 
model classes include dynamic factor models, ARMA models, and linearized dynamic stochastic general 
equilibrium models. However, all of these models have the property that forecasts are linear functions of 
past observations, and, moreover, the weight placed on the past generally shrinks exponentially. These 
properties let us get bounds similar to our previous results. 

Any linear predictors with growing memory can be put in the following form (1 < d < n): 

Yd+v.n+l = Byi:„ (48) 



^Hcrc wc should mention that these risk bounds are frequentist in nature. We mean that if one treats Bayesian methods as 
a regularization technique and predicts with the posterior mean or mode, then our results hold. However, from a subjective 
Bayesian perspective, our results add nothing since all inference can be derived from the posterior. For further discussion of the 
frequentist risk properties of Bayesian methods under mis-specification, see for example Kleijn and van der Vaart [29], Miiller 
[41] or Shalizi [47] 
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where 



bd.i ■ ■ ■ bd.d 
bd+i,i ■ ■ ■ bd+i.d bd+i,d+i 







(49) 



bn,l • • • bn.d bn^d+1 ' ' ' bn,n 

With this notation, we can prove the following about growing memory linear predictors. 

Theorem 4.6. Suppose that Assumption A and Assumption B hold, and that the model class T is linear in 
the data, with growing memory. Further assume that the loss function i satisfies the following conditions: 

1. for some A > 0, (,{y + y') < A(£(y) + £{y')) (modified triangle inequality). 

2. £{yy') < i{y)i{y') (sub-multiplication) . 

Given a time-series of length n, fix some 1 < d < n, and let fj, and a be integers such that 2fia + d < n. Then 
the following bound holds simultaneously for all f G J-: 

yfej" Qn[j) J 

f,exp(w(-^) +4) 



where 



n—l I i—d 



6d{f) - A^E[£ {¥,)] £ + E M E ^^.^y^ I • (51) 

J=l l=d+l \j=l J 

Wc should clarify the conditions on the loss function, and the role of the approximation term. 

The assumptions on the loss function are quite mild. Both conditions are satisfied for any norm: the 
triangle inequality holds with A = 1 (by the definition of "norm"), and sub-multiplication holds by the 
Cauchy-Schwarz inequality. Thus the assumptions hold when, for instance, vector-valued predictions have 
their accuracy measured using matrix norms. Likewise, absolute error loss (£(y — y') = \y — y'\) satisfies both 
conditions with A = 1, while squared error loss satisfies the conditions with A = 2. 

The Sj^(f) term arises from taking a fixed-memory approximation, of length d, to predictors with growing 
memory. As will become clear in the proof, we make this approximation to apply the previous theorem, but 
it involves a trade-off. As d n, Sd{f) Xj 0, but this drives \, 0, resulting in fewer effective training 
points whereas smaller d has the opposite effect. Also, Sd{f) depends on E[i'(Yi)] which is not necessarily 
desirable. However, Assumption B has the consequence that IE[£(Yi)] <M< 00. 

Corollary 4.7. Given a sample Y" such that Assumption A and Assumption B hold, suppose that the model 
class J- is linear in the data and has growing memory. Fix some 1 < d < n. Then, for any f € J-, the 
following bound holds with probability at least 1 — rj. 



Rn{f)<Rn{f)+Sd{f)+M^^^^-^^, (52) 

where £ and rj are as in Theorem 4. 3. 

To apply Theorem 4.6, we specialize to linear Gaussian state-space models, where we can calculate 6dif) 
directly, and demonstrate that it will behave well as n grows. Such models are not, unfortunately, universal, 
but all of the most common macroeconomic forecasting models — including dynamic factor models, ARMA 
models, GARCH models, and even linearized DSGEs — have linear-Gaussian state-space representations. 



13 



The general specification of a a linear Gaussian state-space model, TsSj is 



yt = Zat + et, et^N{0,H), 
at+i = Tat + Vt+i, Vt ~ N(0, Q), (53) 

ai -N(ai,Pi). 

We make no assumptions about the sizes of the parameter matrices Z, T, H, Q, ai, or Pi, but we do require 
stationarity. This amounts to forcing the eigenvalues of T to lie inside the complex unit circle. Stationarity 
ensures that 5d{f) will be bounded as well as conforming to our assumptions about the data generating 
process. 

To forecast using J^sSi one uses the Kalman filter (Durbin and Koopman [18], Kalman [27]). To estimate 
the unknown parameter matrices, we either: (1) maximize the likelihood returned by the filter; or (2) use the 
EM algorithm, alternating between running the Kalman filter (the E-step) and maximizing the conditional 
likelihood by least squares (the M-step). (Bayesian estimation works like EM, replacing the M-step with 
Bayesian updating.) Either way, one can show [18] that given the parameter matrices, the (maximum a 
posteriori) forecast of yt is given by 

t-i t t 
yt+i = zY, n L,K^yj+ZKtyt + Zl[Uai (54) 

j=i 1=1 

where 

Ft - (ZPtZ' + H)-\ Kt = TPtZ'Ft, 

Lt = T- KtZ, Pt+i = TPtL't + Q. (55) 

This yields the form of 5d{f) for linear state-space models. We therefore have the following corollary to 
Theorem 4.6. 

Corollary 4.8. Let J- correspond to a state-space model as in (53), and fix I < d < n. Then the following 
bound holds simultaneously for all f G T: with probability at least 1 — r], 



Rnif) < Rnif) + Sdif) + mJ ^IIJ^, (56) 



where 8 is as in Theorem 4-3, and 



n — d 



5a{f) = A2E[£(ri)] EM n ^^^3 I (57) 



A 



n — 1 I t—d t 



1 E ME n ^^^<^y^ 



n — d — ^ 

t=d+l \ j = i ■i=i + i 

It is simple to compute (Sd(/) using Kalman filter output, so the corollary lets us compute risk bounds 
for common macroeconomic forecasting models. 



5 Bounds in practice 

We now show how the theorems of the previous section can be used both to quantify prediction risk and to 
select models. We first estimate a simple stochastic volatility model using IBM return data and calculate 
the bound for the predicted volatility using Corollary 4.8. Then we show how the same methods can be used 
for typical macroeconomic forecasting models. 
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Figure 2: This figure plots daily volatility (squared log returns) for IBM from 1962-2011. 
5.1 Stochastic volatility model 

We estimate a standard stochastic volatility model using daily log returns for IBM from January 1962 until 
October 2011 — n = 12541 observations. Figure 2 shows the squared log-return series. 
The model we investigate is 

yt = cTZtexpipt/2), Zt^N(0,l), (58) 

pt+i ^ (f)pt + wt, wt N(0, cr^), (59) 

where the disturbances zt and wt are mutually and serially independent. Following Harvey et al. [24], we 
linearize this non-linear model as follows: 

\ogy^ = K+^Pt+^t, (60) 
6=logz2_E[logz2], (61) 
K = \oga^ +E[\ogz^]. (62) 

The noise term is no longer Gaussian, but the Kalman filter will still give the minimum-mean-squared-error 
linear estimate of the variance sequence pi-.n+i- The observation variance is now ir^ /2. 

To match the data to the model, let yt be the log returns and remove 688 observations where the return 
was (i.e., the price did not change from one day to the next). Using the Kalman filter, the negative log 
likelihood is given by 

n 

/:(ri.„|K,0,a2) ^Y.^ogFt+v^F^-\ (63) 

t=i 

Minimizing this gives estimates k — —9.62, (j> — 0.996, and ti^ = 0.003. Taking the £{y,y') = {y — y')"^ gives 
training error Rn{f) = 3.333. 

To actually calculate the bound, we need a few more values. First, using the methods in McDonald 
et al. [34, 35], we can estimate (3^ = 0.017. For a > 8, the optimal point estimate of /3a is 0. While this is 
presumably an underestimate, we will take /3a = for a > 8. For the upper bound in Assumption B, we use 
M = ^2 

Combining these values with the VC dimension for the stochastic volatility model, we can bound the 
prediction risk. For d = 2, the VC dimension can be no larger than 3. Finally, taking p, — 538, a = 11, 
d = 2, and E[Y^] = 1, we get that S2{.f) = 0.60 + 2.13 = 2.73. The result is the bound 

Rnif) < 7.04 (64) 

with probability at least 0.85. In other words, the bound is much larger than the training error, but this 
is to be expected: the data are highly dependent, so the large n translates into a relatively small effective 
sample size p. 



15 



Table 1: This table shows the training error and risk bounds for 3 models. AIC is given as the difference 
from predicting with the global mean (the smaller the value, the more support for that model). 



Model 


Training error 


AIC-Baseline 


Risk bound (1 - ?7 > 0.85) 


SV 


3.33 


-2816 


7.04 


AR(2) 


3.54 


-348 


4.52 


Mean 


3.65 





4.29 




Figure 3: Time series used to estimate the RBC model. These are quarterly data from 1948:1 until 2010:1. 
The blue line is GDP (output), the red line is consumption, the green line is investment, and the orange line 
is hours worked. These data are plotted as percentage deviations from trend as discussed in Section C. 



For comparison, we also computed the bound for forecasts produced with an AR(2) model (with intercept) 
and with the global mean alone. In the case of the mean, we take fj, = 658 and a = 9 since in this case, d — 0. 
The results are shown in Tabic 1. The stochastic volatility model reduces the training error by 5% relative 
to predicting with the mean, an improvement which is marginal at best. But the resulting risk bound clearly 
demonstrates that given the small effective sample size, even this gain may be spurious: it is likely that the 
stochastic volatility model is simply over-fitting. 

5.2 Real business cycle model 

In this section, we will discuss the methodology for applying risk bounds to the forecasts generated by the 
real business cycle (RBC) model. This is a standard tool in macroeconomic forecasting. For a discussion of 
the RBC model and the standard methods used to bring the model to the data, see, for example DeJong 
and Dave [13], DeJong et al. [14], Fernandez- Villaverde [20], Kydland and Prescott [30], Romer [46], Sims 
[49], Smets and Wouters [50]. 

To estimate the parameters of this model, we use four data series. These are GDP yt, consumption ct, 
investment it, and hours worked rif. (The data from the Federal Reserve Economic Database, FRED.) The 
series we use are shown in Figure 3. 

The basic idea of the estimation is to transform the model from an inter-temporal optimization form into 
a state space model. This leads to a linear, Gaussian state-space model with four observed variables (listed 
above), and two unobserved state variables. The mapping from parameters of the optimization problem to 
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parameters of the state-space model is nonlinear, but, for each parameter setting, the Kalman filter returns 
the likelihood, so that likelihood methods are possible. As the data are uninformative about many of the 
parameters, we estimate by maximizing a penalized likelihood, rather than a simple likelihood. Then the 
Kalman filter produces in-sample forecasts which are linear in past values of the data, so that we could 
potentially apply the growing memory bound. 

For macroeconomic time series, there is not enough data to give nontrivial bounds, regardless of the 
mixing coefhcients or the size of the finite memory approximation. Figure 3 shows n = 249 observations. 
The minimal possible finite approximation model is a VAR with one lag and four time series, which, by 
Theorem 4.5, has VC dimension 5. In this case, since we are dealing with vector valued forecasts, we take 
^{y " v') — \\y ~ y'lb- We assume that the Assumption B is satisfied with M = 0.1 and demand confidence 
0.85 [r] = 0.15), 

Again, using the methods of McDonald et al. [34, 35] , we can estimate the /^-mixing coefhcients of the 
macroeconomic data set. The result is a point estimate f3i = 0. Assuming that this is approximately 
accurate (0 is of course an underestimate), this suggests that the effective size of the macroeconomic data 
set is no more than about /z = 31, much smaller then n = 249. To calculate the bound, we assume that 
E[||F||2] < 0.1. Since the loss function is a norm, then A = 1. The training error of the fitted RBC model 
is Rn{f) = 0.00059. Thus our bound is given by 

Rnif) < RnU) + hU) + penalty = 0.00059 -I- 0.18 + 3.07 = 3.26. (65) 

The bound here is four orders of magnitude larger than the training error. If the bound is tight, then this 
suggests that the training error severely underestimates the true prediction risk. Of course, this should not 
be too surprising since the RBC model has 11 parameters and we are trying to get confidence intervals using 
only 31 effective data points. 

In some sense, the empirical results in this section seem slightly unreasonable. Since the results are only 
upper bounds, it is important to get an idea as to how tight they may be. We address this issue in the next 
section. 



6 Properties of our results 

In the previous section, we showed that the upper bound for the risk of standard macroeconomic forecasting 
models may be large. This of course raises the question "How tight are these bounds?" We address this 
question next and then discuss how to use the bounds for model selection. 

6.1 How tight are the bounds? 

Here we give some idea of how tight the bounds presented in Section 4 are. Call ferm is the function that 
minimizes the training error (or penalized training error) over J- , and 

/* = argmini?„(/) (66) 

is the minimizer of the true risk ("pseudo-truth"), i.e. the best-predicting function in F. We call 

L„(H) := mf Ep[i?„(7e™) - i?„(/*)] = mf EpK(/,™)] - Rnif*) (67) 

the "oracle loss" ; it describes how well empirical risk minimization works relative to the best possible predictor 
/* over the worst distribution P. Vapnik [52] shows that for classification and IID data, for sufficiently large 
n, there exist constants c and C such that 



< L„(H) < (68) 
V n V n 

where H is the class of all distributions satisfying V{l{y — y') > K) = 0. In other words, for IID data, 

the best we can hope to do is a rate of O I W '^'^"(-^) J and prediction methods which perform worse than 
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O ^^ '^™(-^) log" J j^j-g inefficient. We will derive similar bounds for the mixing setting. First, we need a 
slightly different version of Theorem 4.3. 

Theorem 6.1. Suppose that £{y — y') < K, that Assumption A holds, and that has a fixed memory length 
d < n. Let /i and a be integers such that 2/ia + d < n. Then, for all e > 0, 

p|^sup|i?„(/)-i?„(/)| >ej <8(2/. + ir°(^)exp|-^|+2^/3,_d. (69) 

where Ki depends only on K . 

The proof of Theorem 6.1 is exactly like that for Theorem 4.3. 
Assumption C. The time series Kxj is exponentially (or geometrically j j3-mixing, i.e. 

Pa = ci exp(-C2a'') (70) 

for some constants ci,C2,k. 

Theorem 6.2. Suppose i(y — y' ) < K and that Assumption C holds. Then, for sufficiently large n, there 
exist constants c and C, independent of n and vcd(J^), such that 



n 

Proof. Theorem 6.1 implies that simultaneously 



and 



iRniferrn) - Rn{ferm)\ > ej (72) 

< 8(2m + 1)™°^^^ exp |-^| + 2(a. - l)Pa-d 

¥ (\R„{f*) - Rn{f*)\ > e) (73) 

< 8(2/i + 1)«^'°(^) exp |-^| + 2(^ - 
Since Rn(ferm) - Rn{f*) < 0, then 

P{\Rn{ferm)-Rnif*)\>2e^ (74) 

< 8(2m + ir^'^^^ exp I + 2(^ - l)/3„_,. 

Letting Z — \Rniferm) — Rnif*)\, ^1 — 8(2^ + and k2 — l/Kf and ignoring constants, 

E[Z^]<s + k[J e-fe^^"'de + 4y finPa^-dde (75) 



POO pK 

Ln{U) <s + k[ / e-'^^^-^de + 4 / , 

Js Jo 



Hnfia„-dde (76) 



= S+ ^- + ksflnPa^-d- (77) 

k2l^n 

Using Assumption C, take a„ = n^^^^'^'^\ /x„ = n'^^''^+'^\ and s ~ ^J°f+i)i. to balance the exponential and 
linear terms. Then, 

/vcd(J") logn 

For the lower bound, apply the IID version 
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If we instead assume algebraic mixing, i.e. /3a — cia~^, then we can retrieve the same rate where 
< K < (r — l)/2 (see Meir [37]). Theorem 6.2 says that in dependent data settings, using the blocking 
approach developed here, we may pay a penalty: the upper bound on i„(n) goes to zero more slowly than 
in the IID case. But, the lower bound cannot be made any tighter since IID processes are still allowed under 
Assumption C (and of course under the more general Assumption A). In other words, we may have k — ^ oo 



6.2 Structural risk minimization 

Our presentation so far has focused on choosing one function / from a model J- and demonstrating that 
the prediction risk i?„(/) is well characterized by the training error inflated by a complexity term. The 
procedure for actually choosing / has been ignored. Common ways of choosing / are frequently referred 
to as empirical risk minimization or ERM: approximate the expected risk Rnif) with the empirical risk 
R-nif), and choose / to minimize the empirical risk. Many likelihood based methods have exactly this 
flavor. But more frequently, forecasters have many different models in mind, each with a different empirical 
risk minimizer. Regularized model classes (ridge regression, lasso, Bayesian methods) implicitly have this 
structure — altering the amount of regularization leads to different models J-. Or one may have many 
different forecasting models from which the forecaster would like to choose the best. This scenario leads to 
a generalization of ERM called structural risk minimization or SRM. 

Given a collection of models JF\,Tit ■ ■ each with associated empirical risk minimizers /i, /2, • • •, we wish 
to use the function which has the smallest risk. Of course different models have different complexities, and 
those with larger complexities will tend to have smaller empirical risk. To choose the best function, we 
therefore penalize the empirical risk and select that function which minimizes the penalized version. Model 
selection tools like AIC or BIG have exactly this form, but they rely on specific knowledge of the data 
likelihood and use asymptotics to derive approximate penalties. In contrast, we have finite-sample bounds 
for the expected risk. This leads to a natural model selection rule: choose the predictor which has the 
smallest bound on the expected risk. 

The generalization error bounds in Section 4 allow one to perform model selection via the SRM principle 
without knowledge of the likelihood or appeals to asymptotic results. The penalty accounts for the complexity 
of the model through the VG dimension. Most useful however is that by using generalization error bounds 
for model selection, we are minimizing the prediction risk. So in the volatility forecasting exercise above, we 
would choose the mean. 

If we want to make the prediction risk as small as possible, we can minimize the generalization error 
bound simultaneously over models T and functions within those models. This amounts to treating VG 
dimension as a control variable. Therefore, by minimizing both the empirical risk and the VG dimension, 
we can choose that model and function which has the smallest prediction risk, a claim which other model 
selection procedures cannot make [32, 53]. 

7 Conclusion 

This paper demonstrates how to control the generalization error of common macroeconomic forecasting 
models — ARMA models, vector autoregressions (Bayesian or otherwise), linearized dynamic stochastic 
general equilibrium models, and linear state-space models. We derive upper bounds on the risk, which hold 
with high probability while requiring only weak assumptions on the data-generating process. These bounds 
are finite sample in nature, unlike standard model selection penalties such as AIG or BIG. Furthermore, 
they do not suffer the biases inherent in other risk estimation techniques such as the pseudo-cross validation 
approach often used in the economic forecasting literature. 

While we have stated these results in terms of standard economic forecasting models, they have very 
wide applicability. Theorem 4.3 applies to any forecasting procedure with fixed memory length, linear or 
non-linear. Theorem 4.6 applies only to methods whose forecasts are linear in the observations, but a similar 
result for nonlinear methods would just need to ensure that the dependence of the forecast on the past decays 
in some suitable way. 



so we can not rule out the faster learning rate of O 
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Rather than deriving bounds theoretically, one could attempt to estimate bounds on the risk. While 
cross-validation is tricky [44], nonparametric bootstrap procedures may do better. A fully nonparametric 
version is possible, using the circular bootstrap (reviewed in [31]). Bootstrapping lengthy out-of-sample 
sequences for testing fitted model predictions yields intuitively sensible estimates of Rn{f), but there is 
currently no theory about the coverage level. Also, while models like VARs can be fit quickly to simulated 
data, general state-space models, let alone DSGEs, require large amounts of computational power, which is 
an obstacle to any resampling method. 

While our results are a crucial first step for the learning-theoretic analysis of time series forecasts, many 
avenues remain for future exploration. To gain a more complete picture of the performance of forecasting 
algorithms, we would want minimax lower bounds (cf. [51]). These would tell us the smallest risk we could 
hope to achieve using any forecaster in some larger model class, letting us ask whether any of the models 
in common use actually approach this minimum. Another possibility is to target not the ex ante risk of the 
forecast, but the ex post regret: how much better might our forecasts have been, in retrospect and on the 
actually-realized data, had we used a different prediction function from the model [8, 45]? Remarkably, we 
can find forecasters which have low ex post regret, even if the data came from an adversary trying to make 
us perform badly. If we target regret rather than risk, we can actually ignore mixing, and even stationarity 
[48]. 

An increased recognition of the abilities and benefits of statistical learning theory can be of tremendous 
aid to financial and economic forecasters. The results presented here represent an initial yet productive 
foray in this direction. They allow for principled model comparisons as well as high probability performance 
guarantees. Future work in this direction will only serve to sharpen our ability to measure predictive power. 

A Auxiliary results 

Lemma A.l (Lemma 4.1 in [56]). Let Z be an event with respect to the block sequence U. Then, 

1P(Z)-P(Z)1 </3,(m-1), (79) 

where the first probability is with respect to the dependent block sequence, U, and P is with respect to the 
independent sequence, U'. 

This lemma essentially gives a method of applying IID results to /3-mixing data. Because the depen- 
dence decays as we increase the separation between blocks, widely spaced blocks are nearly independent 
of each other. In particular, the difference between expectations over these nearly independent blocks and 
expectations over blocks which are actually independent can be controlled by the /3-mixing coefficient. 

Lemma A. 2 (Theorem 7 in Cortes et al. [10]). Under Assumption B, 



sup 



Corollary A. 3. 



Qnif) I 



ncxp [w f-^) +4) 
< 4(2n + 1)^™(^) exp { ^ '- \ . (81) 



B Proofs of selected results 

Proof of Theorem 4-5. The VC dimension of a linear classifier / : M"^ — > {0, 1} is d (cf. Vapnik [53]). Real 
valued predictions have an extra degree of freedom. 
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For the VAR case, we are interested in the VC dimension of a multivariate linear classifier. Thus, one 
must be able to shatter collections of vectors where each vector is a binary sequence of length k. For a 
VAR, each coordinate is independent, thus, one can shatter a collection of vectors if one can shatter each 
coordinate projection. The result then follows from the AR case. I 

Proof of Theorem 4-6 and Corollary ^.1. Let J- be indexed by the parameters of the growing memory model. 
Let T' be the same class of models, but predictions are made based on the truncated memory length d. Define 
Rnif) to be the training error of this truncated predictor /'. Then, for any f € F, and /' S T' 



Rn{f) - Rnif) < (Rnif) ^ + (Rnif) - Rn{f)) + {Rn{f) - Rnif))- 



(82) 



We will need to handle all three terms. The first and third terms are similar. Let B be as above and define 
the truncated linear predictor to have the same form but with B replaced by 



B' 



bd,i bd,2 

bd+1,2 



bd,d 

bd+i,d bd+i,d+ 







bn,n-d+l 



bn n 



Then notice that 

Rnif) -Rnif) < IMf) - Rn{f)\ 



< 



n — a — 
A 

n — d 



^ n— 1 ^ n— 1 



i—d-\-l:i , 



i—d 
n-1 



i—d 



-^Y.i{{h,-h[)Y,.d+i:) 



i—d 



by the triangle inequality where is the i row of B and analogously for b'^ . Therefore 



Rnif) -Rnif) < 



A 

n — d - 

A 

n — d - 



(83) 

(84) 
(85) 

(86) 
(87) 



i—d 
71 — 1 {i — d 



1 E M E ^--lyi 



For the case of the expected risk, we need only consider the first rows of B and B'. Using linearity of 
expectations and stationarity 



Rnif) - Rnif) = miYn+l - KY,.,^+,)] - E[i (Y,,+ , - KVi-.n+l 
< AE[£((b„-b;)yi:„+i)] = AE 



n—d 



n—d 



(n-d \ 

< A2 ^ E[£ (fe„,r,)] <A'Y.^ (^»^) (^^■)] 

71 — d 

<A'E[eiY,)]J2Hb^,) 



Then, 
where 



Rnif) - Rnif) - W) < Rnif) - Rnif) 



i — d 



i — d 



ddif) = A-'E[eiY,)]J2Hb^ 



n-d 



"Y E ^ E ^'-lyo 



(89) 
(90) 

(91) 

(92) 

(93) 
(94) 
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Divide through by Qn{f) and take the supremum over F and F' 



Finally, 



since F' C T. So, 



Rn{f) ~ Rn{f) - W) ^ Rnif')-Rn{f ) 

^""P 7VJT) ™P TTJn " ^^^^ 



sup 7777V ^ 1 (96) 



Rnin-Rnif) RniD-Rnif) Qnif) 

'^''P rwn " ^"P tTTTm TTTTV ^^^^ 

< -P (98) 



Now, 



p / i?n(/)- j?»(/)-^d(/) ^ \ ^„ Rnin-Rnif) ^ \ 

Since J^' is a class with finite memory, we can apply Theorem 4.3 and Corollary 4.4 to get the results. I 
Proof of Corollary 4-8. This follows immediately from Corollary 4.7 and (54). ■ 

C Data 

The data to estimate the RBC model is publicly available from the Federal Reserve Economic Database, 
FRED (http: //research. stlouisfed.org/fred2/). The necessary series are shown in the Tabic 2. All of 
the data is quarterly. The required series are PCESVC96, PCNDGC96, GDPICl, HOANBS, and CNP160V. 
These five series are used to create four series [y'f., cj, i[, h'^] as follows: 

, .PCESVC96 + PCNDGC96 , , 

^-^■^^10 CNPmw 

i\ ^ 2.5 X 10^ G^P^C^ (101) 
yt = ct + H (102) 

'■--^^ 

We use the preprocessed data which accompanies DeJong and Dave [13] (http : //www.pitt . edu/~dejong/ 
seconded.htm). We then apply the HP-filter described in Hodrick and Prescott [25] to each series individ- 



ually to calculate trend components 



yt, ct, it, ht 



. The HP-filter amounts to fitting the smoothing spline 



xi:„ = argmin^(a;j - Zf)^ -|- A^((zt+i - Zt) - izt - Zt_i))^, (104) 



Zl:7i 



t=l t=2 



with the convention A — 1600. Wc then calculate the detrended series that will be fed into the RBC model 

as 

a;* = logxj - logX(. (105) 

The result is shown in Figure 3. 
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Tabic 2: Data series from FRED 



Series ID Description Unit Availability 



PCESVC96 


Real Personal 
Consumption 


Billions of 


1/1/1995 




Chained 2005 $ 






Expenditures: Services 






PCNDGC96 


Real Personal 
Consumption 
Expenditures: 
Nondurable Goods 


Billions of 
Gnamed 2005 a 


1/1/1995 




Real Gross Domestic 
Investment 


JDililUilO Ui 

Chained 2005 $ 


1 /I /I Q47 
1/1/ ly^ ( 


HOANBS 


Nonfarm Business 
Sector: Hours of All 
Persons 


Index: 2005=100 


1/1/1947 


CNP160V 


Civilian 

Noninstitutional 
Population 


Thousands of 
Persons 


1/1/1948 



Table 3: Priors, constraints, and parameter estimates for the RBC model. 



Prior Constraint 



Parameter 


Estimate 


Mean 


Variance 


Lower 


Upper 


a 


0.24 


0.29 


2.5x10-2 


0.1 


0.5 




0.99 


0.99 


1.25x10-3 


0.90 


1 




4.03 


1.5 


2.5 


1 


5 




0.13 


0.6 


0.1 





1 


6 


0.03 2 


2.5x10-2 


1x10-3 





0.2 


P 


0.89 


0.95 


2.5x10-2 


0.80 


1 




3.45x10-5 


1x10--^ 


2x10-5 





0.05 


ay 


1.02x10-*^ 









1 




2.30x10-5 









1 




e.iixio-'^ 









1 




1.68x10-"^ 
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D Estimation 

To estimate, we maximize the likelihood returned by the Kalman filter, penalized by priors on each of 
the "deep" parameters. This is because the likelihood surface is very rough and there exists some prior 
information about the parameters. Additionally, each of the parameters is constrained to lie in a plausible 
interval. Each parameter has a normal prior with means and variances similar to those in the literature. 
We generally follow those in DeJong et al. [14]. The priors, constraints (which are strict), and estimates are 
shown in Table 3. 
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